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Abstract 

All over the world, future parents are facing the task of finding a suitable given name for their child. This choice is 
influenced by different factors, such as the social context, language, cultural background and especially personal taste. 
Although this task is omnipresent, little research has been conducted on the analysis and application of interrelations 
among given names from a data mining perspective. 

The present work tackles the problem of recommending given names, by firstly mining for inter-name relatedness 
in data from the Social Web. Based on these results, the name search engine "Nameling'ljwas built, which attracted 
more than 35,000 users within less than six months, underpinning the relevance of the underlying recommendation 
task. The accruing usage data is then used for evaluating different state-of-the-art recommendation systems, as well 
our new NameRank algorithm which we adopted from our previous work on folksonomies and which yields the 
best results, considering the trade-off between prediction accuracy and runtime performance as well as its ability to 
generate personalized recommendations. We also show, how the gathered inter-name relationships can be used for 
meaningful result diversification of PageRank-based recommendation systems. 

As all of the considered usage data is made publicly availabl^] the present work establishes baseline results, 
encouraging other researchers to implement advanced recommendation systems for given names. 

1 Introduction 

The choice of a given name is typically accompanied with an extensive search for the most suitable alternatives, at 
which many constraints apply. First of all, the social and cultural background determines, what a common name is 
and may additional imply certain habits, such as, e. g., the patronym. Additionally, most names bear a certain meaning 
or associations which, also depend on the cultural context. 

Whoever makes the decision is strongly influenced by personal taste and current trends within the social context. 
Either by preferring names which are currently popular, or by avoiding names which most likely will be common in 
the neighborhood. Recently, public discussion of psychological effects associated with certain given names, which 
eventually may lead to social discrimination of individuals [31], increased the perception of responsibility of future 
parents, making the process of finding a given name even more involved. 

Future parents are often aided by huge collections of given names which list several thousand names, ordered 
alphabetically or by popularity. With the first author's need for a given name for his child, the idea arose to collect data 
from the "Social Web" in order to derive background information, popularity, interrelations and similarities of given 
names. The search engine "Nameling" |26] utilizes Wikipedia's text corpus for interlinking names and the micro- 
blogging service Twitter for capturing current trends and popularity of given names. Nevertheless, the underlying 
rankings G7]| and thus the search results are statically bound to the underlying co-occurrence graph obtained from 
Wikipedia and thus not personalized. 

This work presents the task of recommending given names, applying approaches for distributional semantics and 
state-of-the-art recommender systems, such as user based collaborative filtering (UCF), item based collaborative filter- 
ing (ICF) and matrix factorization approaches. Additionally, the recommendation algorithm NameRank is presented, 
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which is adopted from previous work on folksonomies [ 12 1, showing good performance in terms of prediction accuracy 
as well as runtime complexity. 

The rest of the work is structured as follows: In Sec. |2j inter-name similarities are derived and evaluated. Sec. [3] 
presents the usage data of the Nameling search engine which is then used for evaluating name recommendation ap- 
proaches. Then, in Sec. [4] different approaches for result diversification are discussed and evaluated. 

2 Relatedness of Given Names 

This section aims at discovering relations among given names, which can be used to aid future parents in searching 
and finding suitable names by means of ranking and recommendations techniques. 

There are many reasons for names being related, ranging from phonetic similarity to common cultural context or 
similar "meaning". With the rise of the Social Web, a whelm of data sources becomes available, interconnecting users 
and object information, either explicitly or implicitly [25 1. In particular, given names are linked by social interactions 
of respective holder of the names as well as occurrences within heterogeneous, mostly unstructured data collections. 

Ultimately, by leveraging data from online social networks, microblogging systems and encyclopedic background 
information, recommendation systems for given names may consider the user's social context, cultural habits, personal 
taste and common popularity. In the following, we present a simple, yet powerful approach for mining interrelations 
among given names, which is based on co-occurrence networks. 

Data Sources For building co-occurrences networks of given names, we used the official Wikipedia data dump 
which is freely available for downloacQ and considered the English (date: 2012-01-05), French (2012-01-17) and 
German (2011-12-12) version separately. We additionally used the categorization of the affiliated Wiktionary project 
(English, French and German 2012-06-06), also available for download. As an additional source for user generated 
data, we considered the microblogging service Twitter. Using Twitter, each user publishes short text messages (called 
"tweets"). We used the data set introduced in [42], which comprises 476,553,560 tweets from 17,069,982 users, 
collected 2009/06 until 2009/12. Some effort was made to build up a comprehensive list of given names. In a semi- 
automatic way, a list of more than 30,000 names was collected. During the first months of the Nameling's live time, 
additional names were proposed by users of the system, yielding a list of 36,434 given names. 

2.1 Networks of Given Names 

One of the most basic notions of relatedness between two given names can be observed, when they occur together in 
an atomic context of a given data collection. In case of Wikipedia, we counted such co-occurrences based on sentences 
and for Twitter based on tweets. We thus obtain for each data source S € {EN, DE, FR, Twitter} (English, German and 
French Wikipedia as well as Twitter) an undirected weighted graph G$ = (V5, Eg) where V5 denotes the subset of all 
observed names within S and for names u, v exists an edge (u, v) € E$ with weight w(u, v), if u and v co-occurred 
in exactly w(u, v) contexts. For example, the given names "Peter" and "Paul" co-occurred in 30,565 sentences within 
the English Wikipedia. Accordingly, there is an edge (Peter, Paul) in Gen with corresponding edge weights. 

Table [T] summarizes the high level statistics for all considered co-occurrence networks. As one would expect, all 
networks contain a giant connected component [32| which almost covers the whole corresponding node sets. The 
network obtained from the English Wikipedia is the most densely connected network whereas the network obtained 
from Twitter is the least densely connected. 

Inter-Network Correlation Test For assessing the pairwise structural interdependence of the different networks, 
we apply the quadradic assignment procedure (QAP) test [3 , 4|. For given graphs G\ = (Vi, Ex) and G2 = (V2, E2) 
with U := V\ fl V2 7^ and adjacency matrices Ai corresponding to G^u (Gi reduced to the common vertex set U), 
the graph covariance is given by 

_^ n n 

cov(G 1 ,G 2 ) ~ ^ EE^ 1 ^'! " - A2) 
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Table 1: High level statistics for all co-occurrence networks. 
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where n := \U\ and pi denotes Ai's mean (i 
p(G 1 ,G 2 ) - cot,(Gl - G2) — 



1, 2). Then var(Gi) := cov(Gi, Gi) leading to the graph correlation 



The QAP test compares the observed graph correlation po to the distribution of resulting correlation scores obtained 
on repeated random row/column permutations of A 2 . The fraction of permutations n with correlation p v > p Q is used 
for assessing the significance of an observed correlation score p a . Intuitively, the test determines (asymptotically) the 
fraction of all graphs with the same structure as G 2 \u having at least the same level of correlation with Gi\u- 

Table 2a shows the pairwise correlation scores for all considered networks. The Wikipedia-based co-occurrence 
graphs shows the strongest correlation. For assessing the significance of the observed correlations, we repeatedly 
calculated the pairwise correlations on 1,000 corresponding randomly generated null models. For any pair of the 
considered networks, every randomly generated null model showed much lower correlation scores (< 10~ 3 ), which 
indicates statistical significance [3|. 

We conclude that the co-occurrence networks structurally correlate. Nevertheless, language specific deviations 
exist. For discovering relations on named entities, the corresponding language should therefore be considered. In the 
next section we will investigate, how structural similarity within the co-occurrence networks correlate with natural 
notions of relatedness among given names. 



2.2 Mining for Relations from the Social Web 

In this section, we focus on the question, whether structural similarity in the co-occurrence networks from Section |2"T| 
gives rise to a notion of relatedness which implies relationships the user might be interested in. For evaluating and 
comparing different similarity metrics, we need a "reference" notion of relatedness for the considered names to be used 
as "ground truth". For given names, there is no generally accepted reference relation. We therefore apply the approach 
of using an external data source which we assume as a valid "gold standard". We argue that the categories assigned to 
names in Wiktionary are a good basis, as they are manually assigned and have a direct connection to concepts users 
associate with given names (such as gender and cultural context). We finally chose cosine similarity for calculating a 
reference similarity score, which is broadly accepted for various applications. For the sake of a concise presentation, 
we restrict our analysis in this chapter to the network obtained from the English Wikipedia and from Twitter. 



Vertex Similarities Subsequently, we only consider the two similarity functions which yielded the most promising 
results. For the discussion and evaluation of broader range of similarity metrics, refer to [29 1. The Jaccard coefficient 
measures the fraction of common neighbors. For weighted networks, we also consider its weighted variant ll30l . The 



Table 2: Left: Pairwise graph correlation observed in the co-occurrence graphs. Right: Basic statistics for the different 
activities within the Nameling 
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cosine similarity measures the cosine of the angle between the corresponding rows of the adjacency matrix, where we 
consider both its weighted and unweighted variant. 

For obtaining a reference relation on the set of given names, we collected all corresponding category assignments 
from Wiktionary. We thus obtained for each of 10,938 given names a respective binary vector, where each component 
indicates whether the corresponding category was assigned to it (in total 7,923 different categories and 80,726 non-zero 
entries). We laxly denote these category assignments as "semantic" properties and accordingly the induced similarity 
as "semantic similarity" of names. 

Neighborhood & Similarity As a first analysis of the interdependence of a name's position within the co-occurrence 
network and its category assignment in Wiktionary, we consider for pairs of names their respective shortest path 
distance relative to their reference similarity. Considering the co-occurrence networks obtain from Wikipedia and 
Twitter separately, we calculate the average corresponding similarity score for every shortest path distance d and every 
pair of names u, v with a shortest path distance d. To rule out statistical effects, the same calculations are repeated 
at which the correspondence of names within the network and the category assignment is randomly shuffled. Fig. [T] 
shows the results for the cosine similarity together with the Jaccard coefficient. In both networks, the similarity of 
node pairs tends to decrease monotonically with the respective shortest path distance, where direct neighbors are in 
average more similar than randomly chosen pairs (referring to the null model baseline) and pairs at distance two are 
already less similar than expected by chance. 

Structural & Semantical Similarity The interplay of shortest path distance and semantic similarity of names indi- 
cates that the structural context of a name within the co-occurrence network is correlated with its semantic properties. 

We now compare different vertex similarity metrics with respect to their ability to capture semantic similarity 
of names. In detail, we calculate for any pair u, v of names in the co-occurrence network (which have a category 
assignment) the cosine similarity COS(u, v) based on the respective category assignment vectors as well as any of 
the considered vertex similarity metrics s(u, v). As the number of data points (COS(u, v), s(u, v)) grows quadrati- 
cally with the number of names, we grouped the co-occurrence based similarity scores in 1,000 equidistant bins and 
calculated for each bin the average cosine similarity based on category assignments. Figure [2] shows the results for 
Wikipedia and Twitter separately. 

Notably, all considered similarity metrics capture a positive correlation between similarity in the co-occurrence 
network and similarity between category assignments to names. But significant differences between the underlying 
co-occurrence networks and the applied similarity functions can be observed. As for Wikipedia, the cosine similarity 
shows similar characteristics in its weighted and unweighted variant but for higher structural similarity scores the 
weighted variant is more consistent with the semantic similarity of names. Considering the Jaccard coefficient, the 
unweighted variant is more consistent with the reference similarity then its unweighted variant. 

As for Twitter, both cosine similarity and the Jaccard coefficient are more consistent with the reference similarity 
for lower and average structural similarity scores. For higher structural similarity, the weighted cosine similarity shows 
strong correlations with the reference relation. 

Summing up, edge weights without further pre-processing may significantly decrease the performance of a simi- 
larity metric with respect to its ability to capture semantic similarity of names. Furthermore, no clearly best structural 
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Figure 1: Semantic relatedness vs. shortest path distance in the co-occurrence networks. 
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Figure 2: Similarity based on name categories in Wiktionary vs. vertex similarity in the co-occurrence-networks 
(weighted and unweighted). 

similarity can be deduced from the conducted experiments. Nevertheless, correlations between structural and semantic 
similarity can be observed at all consistent vertex similarity functions. 

3 Recommending Given Names 

The results presented in the previous section indicate correlations between semantic relatedness among given names 
and structural similarity within co-occurrence networks obtained from different data sources. 

We now turn our focus towards systems for personalized name recommendations, leveraging the usage data which 
accrued at Nameling's log files. We firstly summarize the considered usage data and analyze the emerging networks 
of given names. We then briefly summarize different state-of-the-art recommendation systems and present our own 
recommendation approach. All considered recommendation systems are finally evaluated and compared with respect 
to their prediction accuracy. 

3.1 Preliminaries 

Throughout this section, we utilize u, v and i, j interchangeably as placeholder or index variables to denote users and 
items (i. e., names) respectively. We denote the set of users with Af and the set of items with A4, denoting the number 
of users with n := \Af\ and the number of names with m := \A4\. 

The context of our work is given by Nameling, a search and recommendation system for given names, where 
a user u might express interest in name i, either by entering the name directly in a search mask, click on a name 
or add a name to the personal list of favorite names. We denote it's expressed affiliation with i by r^ ter , r^ l - ck and 
r uT°"' e ' respectively. All user-name affiliations are aggregated in the corresponding binary user-name matrices R En ' er , 
ftChck an( j ^Favorite whenever the corresponding activity class is irrelevant or determined by the context, we drop the 
superindex. Furthermore we consider the set A4 U := {j € M. \ r U j > 0} of all names j which user u is affiliated 
with as well as the set Af% := {v 6 Af \ r V i > 0} of all users v who are affiliated with the ith given name. 

Please note that we consider binary user-name affiliations, i. e., r U i € {0,1} where a zero value may either indicate 
that user u is not interested in i, has not yet expressed her/his affiliation or is just unaware of i. A positive value of r ui 
on the other hand only in the case of adding a name to the personal list of favorites clearly indicates that u likes i. The 
reason for entering or clicking on a name might also just be curiosity. 

Recommending given names for a user u based on a given user-name matrix R corresponds to predicting u's 
affiliation with any name i E A4 \ A4 U which we denote with f UJ . Most recommendation systems determine for each 
such user-item pair a score which reflects the recommender's confidence about positive affiliation. Recommending k 
names for u is achieved by taking the top k names from the thus ordered list of names. For a recommendation system 
Rec, we denote the ordered list of k recommended names for user u with Rec k (u) and write Rec(u) to denote Rec's 
ranking of all names for u, based on the recommender's scoring function. 



For assessing the prediction accuracy of a recommendation system, we process each user u separately and select 
for evaluation a subset Test(u) C A4 U . Recommendations for u are then calculated on R which is element-wise given 
by 

JO, if v = u and j E Test(u) 



r v j :- 



r V j, otherwise. 



We consider several metrics for scoring a recommendation system's prediction accuracy: 

Precision/Recall Precision and recall are metrics originating from the evaluation of information retrieval systems, 
given by 

ki 



n - - kr n l Rec (u)nTest(u)\ 
Precision!?/) := 1 

k 

' \Test(u)\ 

We interchangeably also write Precision@k and Recall@k to denote the average scores over all users u € TV. 

Average Precision/MAP The mean average precision is a metric for obtaining a single value performance score for 
a recommendation system Rec considering all recommended names for all users: 

1 ™ 

MAP := V AveP(w) 

n ±- — * 

where the average precision is given by 



n 

u=l 



1 \M U \ 

AveP(u) := — — [Precision fc (u) • 5 k {u) 

k— 1 



and S k (u) — 1, iff the recommended element at rank k is a relevant name (that is Rec(w)[fc] € Test{u)). Refer to [40] 
for more details. 

Normalized Discounted Cumulative Gain The basic idea of the discounted cumulative gain metric (DCG) is that 
relevant items at higher ranking positions should be penalized. Typically, the penalization factor scales inverse loga- 
rithmically with the position in the list of recommended items: 

DCG fe (u) :=V 2 ~\ 
t^log 2 (* + l) 

For normalizing the DCG score across recommendations for different users, the DCG score is considered relative to 
the best possible DCG score for the considered user, called the ideal score IDCG, resulting in 

NDCG*( U ) := DCG ^ 
V ' IDCG fe ( M ) 

For maximal k we drop the index and write NDCG Please note hat we restrict our discussion to the binary case. 
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Test of Significance For comparing the prediction accuracy of different recommendation systems, typically the av- 
erage of some evaluation metric is compared. Due to the sensitivity of the average towards outliers such a comparison 
may often lead to erroneous conclusions. For further comparison the distribution of the respective evaluation met- 
ric should be taken into account too, e. g., in order to apply some statistical test for assessing the significance of the 
observed difference. 

We apply the simple sign test for assessing, whether a recommendation system A significantly outperforms rec- 
ommendation system B, relative to some per user evaluation metric. For this purpose, we count the number tia of 
users where A yields better scores than B and inversely rig. We thereby ignore all users with equal evaluation scores. 
The sign test assesses the significance p of an observed outcome ua and ub by estimating the probability that A is not 
truly better than B with the probability that tia out of n := Ua + Ub uniform Binomial tries succeed: 



P = 



n n , 

En! 
iUn - i)\ 

i—riA 



Please refer to [36, 6] for further details and more advanced hypothesis tests for comparing recommendation systems. 
Please note that we only assess the significance of selected results in order to reduce the impact of correcting the 
influence of multiple testings (e. g., Bonferroni correction). 



3.2 Usage Data 

For our analysis and evaluation, we considered the Nameling's activity log entries within the time range 2012-03- 
06 until 2012-08-10. In the following, we firstly describe the collected usage data, analyze properties of emerging 
network structures and finally compare interrelations between the different networks. 

In total, 38,404 users issued 342,979 search requests. Subsequently, we differentiate between activities where a 
user manually entered a given name {"Enter"), followed a link to a name within a result list ("Click") or added a 



given name the list of favorite names ("Favorite"). Table 2b summarizes high level statistics for these activity classes, 
showing, e. g., that 35, 684 users entered 16, 498 different given names. 

For analyzing how different users contribute to the Nameling's activities, Figure[3]shows the distribution of activi- 
ties over the set of users, separately for Enter, Click and Favorite requests. Clearly, all activities' distributions exhibit 
long tailed distributions, that is, most users entered less than 20 names but there are also users with more than 200 
requests. 

In order to assess the interdependence of name contexts established by search queries within the Nameling and 
co-occurrences within Wikipedia, we further constructed for each activity class C £ {Enter, Click, Favorite} a corre- 
sponding weighted projection graph Gc, where an edge exists with weight k, if k users searched for both names 
i and j). Table [3] summarizes various high level network statistics for all considered projection graphs. All projection 
graphs encompass a giant connected component 021 . giving rise to a notion of relatedness among names. 

Please note that these usage graphs themselves can be used for calculating similarity among given names, e. g., by 
applying the same similarity metrics as discussed for the co-occurrence graphs. 
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Figure 3: Number of users per query count for Enter, Click and Favorite activities. 
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Table 3: High level statistics for all considered projection graphs with the number of weakly connected components 
#wcc and largest weakly connected components lwcc. 
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For assessing the interdependence of the name contexts established by the different co-occurrence networks from 
Sec. 2.1 and those emerging from Nameling's query logs, we apply again the quadradic assignment procedure test 
(cf. Sec. 2.1 1. Table [4] shows the pairwise correlation scores for all considered networks. Again, correlations are 



more pronounced within a system (e. g., 0.466 for the GEnter and Gciick)- But the graph level correlations for networks 
obtained from the Nameling exhibit significantly higher correlation scores for the co-occurrence network obtained 
from the German Wikipedia, indicating that the dominance of German users has an impact on the emerging name 
contexts within the search based networks which are more related to the accordingly language dependent network 
structure within the co-occurrence network from Wikipedia. Finally, the correlation scores for the Twitter based co- 
occurrence networks shows the least magnitude, though still signficant for GEnter and Gf mor u e in terms of the QAP test 
(p < 0.05). 

Summing up, we conclude that users accessing a name search engine like Nameling are implicitly establishing 
name contexts which differ from those, obtained from encyclopedic data sources, such as Wikipedia. These user 
centric name contexts are more likely to reflect the user's taste and personal preference and accordingly may be used 
for generating personalized name recommendations. 



3.3 The Recommendation Task 

The choice of the given name is one of the first important decision future parents have to make. Many influencing 
factors have an impact on this decision process, such as, e. g., cultural background, current trends and personal taste. 
Typically, large collections of given names with additional background information for the corresponding names are 
at hand, either in the form of a lexicographical book or a specialized web site. Where a search engine for given names 
allows the user to browse through the list of names ordered by some notion of relevance, a recommendation system 
suggests a small personalized list of names which the user might be interested in. 

Different constraints can apply to the list of recommended names, notably depending on the amount of background 
information available. Given names from the user' ego-network of some on-line social network, e. g., could be ignored 
(assuming the user knows the given names of her or his friends), or, a certain diversity of names requested, as there is 
not desirable to present the user, e. g., just different variants of a single name. Each of such constraint influences the 
assessment of the quality of a recommendation system. Accordingly, there is no single valid evaluation protocol. Be- 
low, we present the evaluation protocol we applied for establishing baseline results for the quality of recommendation 
systems for given names. 



Evaluation Protocols The evaluation of a recommendation system is strictly dependent on the targeted objec- 
tive 11361 . Ultimately, only an appropriate comparative live evaluation of different recommendation systems can com- 
prehensively assess the performance of such systems. In this section, we focus on the prediction accuracy relative to 



a publicly available activity log file from Nameling as presented in Section 3.2 



For this purpose, we process each user u separately and remove from the set of entered search names Enter(u) 



Table 4: Graph level correlations for all pairs of considered networks. All observed correlations are significant accord- 
ing to the quadradic assignments procedure (QAP). 
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a subset Test{u) for evaluation. We then use all activity data without the selected evaluation data for calculating 
recommendations and assess the prediction accuracy relative to Test(u). We applied different strategies for selecting 
the evaluation set Test(u), each of which suitable for examining certain research questions: 

TakeKIn: Test(u) is randomly sampled with size \Enter(u)\ — k. 
LeaveKOut: Test(u) is randomly sampled with size k. 

TakeFirstln: Test(u) consists of the last \Enter(u)\ — k names the user u has entered. Name recommendations are 
therefore based on the first k names. 

LeaveLastOut: Test(u) consists of the last k names the user u has entered. 

Whenever results among different number of known names k < -/V max are compared, we additionally removed 
-^max — k names from the user's test set Test(u) to ensure that results for different values of k are comparable and not 
determined through varying sizes of Test(u). 

Please note that we only used the set of directly entered names {Enter) for evaluation and did not use the combi- 
nation of Enter and Favorite events for evaluation, as these interrelations among names have a strong bias induced by 
the ranking function which was implemented in Nameling |26|. 

Evaluation Metrics Various metrics for assessing the prediction accuracy of recommendation systems exist and the 
choice of the applied metric depends, among others, on the application context. Firstly, global metrics like MAP and 
NDCG summarize the prediction performance of all recommended items, whereas prefix metrics like Precision® A; 
and NDCG@/s only consider the first k recommended items. 

For the present work, we differentiate the recommendation task from searching by restricting recommendations 
to the corresponding top k items, giving favor to Precision@fc and NDCG@/c, whereof the latter accounts for the 
ordering of the recommended items within the top k positions. For reference though, we also consider MAP and 
NDCG. When comparing different recommendation systems based on the corresponding MAP scores, special care 
has to be taken if one of the recommendation systems fails to generate rankings for all items. Consider, e. g., the 
PageRank based NameRank and user based collaborative filtering UCF. By construction, NameRank assigns weights 
to all names, whereas UCF cannot infer weights for names which are not connected to one of the queried user's search 
names. Now, even for very low weights, relevant names are considered for calculating the average precision score 
which may degrade MAP significantly (due to the sensitivity of average towards outliers). 

We argue that the fairest handling of such situations is to virtually place all £ relevant names for a user u which an 
recommendation system Rec failed to recommend, at the end of Rec(u). 

3.4 Experimentation 

This section focuses on the evaluation of the prediction accuracy of different recommendation systems. On the one 
side, the obtained results are a basis for deciding which recommendation system should be used for recommending 
given names in a corresponding application. On the other side, we thereby establish baseline results for the task of 
recommending given names which serve as a reference for developing and evaluating new approaches. 

3.5 Applied Recommender Systems 

Subsequently, we shortly summarize each considered recommendation system and introduce according abbreviations 
which we henceforth use for identifying individual recommendation systems. 

User based CF (UCF): Adopting the weighted sum approach for user based collaborative filtering l38ll to the binary 
case (a user searched or searched not for a given name), we set 

VCF(u,i) := SIM(u,v). 

For the nearest neighbor approach, only the top N similar users to u are considered in the summation. 

For recommending k names to user u, the top k names, ordered descending by UCF(u, *), are determined 
(ignoring names j with r U j > 0). 
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Item based CF (ICF): Adopting the weighted sum approach in |34| to the binary case, we set: 

For recommending k names to user u, the top k names ordered descending by ICF(u, *) are determined (ignoring 
names j with r U j > 0). 

PPR/PPR+ The preferential PageRank similarity is based on the well known PageRank [ 1 1 algorithm. For am x m 
column stochastic adjacency matrix A and damping factor a, and uniform preference vector p := (1/m, . . . , 1/m) 
the global PageRank vector w =: PR is given as the fixpoint of the following equation: 

w = aAw + (1 — a)p 

In case of the preferential PageRank for a given set of nodes I, only the corresponding components of the 
preference vector are set and we set accordingly PPR(1) to the fixpoint of the above equation with 



Pi 



f^, ifieJ 

I 0, otherwise. 



As a new item based recommendation approach we propose NameRank (for brevity abbreviated with PPR+), 
which is an adoption of the idea presented in |[T2l . where the global PageRank score PR is subtracted from the 
preferential PageRank score in order to reduce frequency effects and set 

PPR+(1) := PPR{1) - PR . 

For recommending k names to user u, we calculate w := PPR+(M U ) on the column stochastic adjacency 
matrix derived from GEnter (cf. Sec. |2]l and recommend the top k names ordered descending by w, thereby 
ignoring names j with r U j > 0. 

Please note that we show results obtained by averaging the rankings from individual query names, i. e., 

1 1 jei 

which showed the same results as PPR+il). 
WRMF The weighted matrix factorization method Ifl3l is designed to deal with implicit feedback scenarios like the 



our evaluation setting described in 3.3 The observed user-item interactions r ui are interpreted as indicators of 
user u's preference for item i which is associated with a certain level of confidence, depending on the amount 
of observed interactions. Unobserved interactions f u , are predicted by building a factor model of the user-item 
matrix R via an regularized learning method with alternating least squares optimization. 

MostPopular The MostPopular recommendation approach predicts for all users the most popular names, i. e., the top 
k names ordered decreasing by frequency. 

Random The Random recommendation approach randomly selects k names which the user has not searched before. 

The item-based and user-based collaborative filtering as well as the MostPopular approaches were implemented us- 
ing the machine learning library Mahouj^] We evaluated various similarity metrics, showing here results for the 
Log-Likelihood similarity [7] which outperformed the others. The PageRank-based approaches PPR and PPR+ are 
implemented using the eigen^] library for matrix operations. For WRMF we applied the implementation of the my- 
MediaLite [9| library with the corresponding default parametrization. 

We also considered the Bayesian personalized ranking approach for matrix factorization (33), additionally with 
soft-margin optimization as proposed in ATI as implemented in the myMediaLite recommendation library. But the 
obtained result showed worse prediction performance than the simple MostPopular recommendation approach and are 
therefore excluded from the presentation below for clarity. 
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Evaluation Data For evaluating the prediction accuracy of the considered recommendation systems, we applied the 
different evaluation protocols described in Section 3.3 for k = 1, . . . , N max with iV max := 10. To assure that for every 
user u there are at least five names available for testing and prediction, respectively, we restricted our evaluation to 
users who entered more than 14 different names (that is, \Enter(u)\ > 15), leaving 1,230 users for evaluation. Please 
note that we neither considered the set Click(u) of names which the user clicked on nor the set Favorite{u) which the 
user added to the list of favorite names, as these data sets are strongly biased by the ranking which was implemented 
in Nameling's search back-end. 



3.6 Results 

We now present our results on the prediction accuracy of the considered recommendation systems. We firstly present 
the results for the TakeKIn and LeaveKOut experiments where the results are obtained relatively to the number of 
names and finally the results of the TakeFirstln and LeaveLastOut experiments, where results are obtained relative to 
the chronologically ordered^rif names of a user's search history. 



TakeKIn This experiment aims at comparing the impact of personalization on the prediction accuracy of the different 
recommendation systems. In particular we want to examine... 

• ...whether and to which extend the performance of a recommendation system benefits from an increasing number 
of known names (i. e., increasing k). 

• ...whether for new users with only few known names, different recommendation systems perform best than for 
those users with a large search history. 

For this purpose, we applied the TakeKIn evaluation protocol as described in Section[33]For eliminating the influence 
of certain selections of known names, we repeatedly sampled (25 times) for each user the set of k known names and 
averaged the resulting evaluation scores. Furthermore, we removed N max — k names from the user's test set Test(u) to 
ensure that results for different number of known names k are comparable and not determined through varying sizes 
of Test(u) for different k. 

Figure [4] shows the obtained results for Precision@5, NDCG@5 and MAP. Please note that we didn't include the 
random baseline results into Figure |4] for clarity, as all corresponding evaluation scores were two magnitudes below 
the worst result of the other recommenders. 

We begin with a discussion of some general observations. Firstly, all obtained performance scores are low in 
magnitude, indicating that given names are hard to predict based on the user's search history. One of the key difficulties 
in such implicit feedback data is due to the indistinguishable intent of search requests [13): A user might search for a 
name because the user likes the name, or the user might search for names he doesn't like and just wants to explore the 



Take-K-ln 



Take-K-ln 



Take-K-ln 



® 





— r 

10 




PPR 

PPR+ 

Item CF 

UserCF 

Popular 

WRMF 



~i 1 r 

6 8 10 



(a) Precision @k 



(b) NDCG@fc 



(c) MAP 



Figure 4: Precision accuracy of the different recommendation systems, relative to varying number of known names 
per user. For clarity of presentation the random baseline was omitted which lies strictly below 10~ 3 for all metrics. 
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names neighborhood. Equally, the fact that a user did not search for a certain name might either express that the user 
dislikes the name, was not aware of the name or just did not search for the given name until now but eventually will. 

Figure [5] exemplarily presents the distribution of the Precision@5 and MAP scores for PPR+ with ten known 
names, showing that the low scores are due to a highly skewed distribution where for most users no or only few 
relevant names are predicted. This, on the other hand, emphasizes the importance of additional analysis beside calcu- 
lating average prediction metrics, as the average over skewed distributed values is sensitive to outliers. Nevertheless, 
the considered performance metrics consistently assess the impact of an increasing number of known names for all 
considered recommendation systems. 

For a more formal analysis of the consistency among the considered performance metrics, we assessed the pre- 
diction accuracy of each considered recommendation setting (in total 1435 models) with every performance metric, 
respectively averaged over all users. We thus obtained for each metric a ranking on all models, among which we calcu- 
lated Kendall's r rank correlation coefficient. Table[5]shows the calculated correlation scores. Notably, all metrics but 
NDCG@10 show a very high correlation, supporting the observed consistent assessment of different recommendation 
systems with varying number of known names in Figure [4] 

We now turn our focus to the discussion of the performance scores for the different recommendation systems in 
Figure [4] The most popular recommendation approach shows a relative good performance, as expected independent 
of the number k of known names. This is therefore a suitable baseline for assessing other recommendation systems' 
performance scores. All other considered methods benefit from increasing k. In particular the item based collaborative 
filtering shows a linearly increasing performance scores, thereby showing worse performance than the most popular 
baseline for lower k. Except for k < 2, the user based collaborative filtering approach shows the highest performance 
scores. 

As for PPR and PPR+, the former shows better performance for lower k (even outperforming user based collab- 
orative filtering) whereas the latter steadily increases with the number of known names. Please note that this is in line 
with the intention for the construction of PPR+: the plain preferential PageRank is strongly influenced by global fre- 
quencies which are reduced by subtracting the global PageRank (cf. Section [3~5| l. As the popular (i. e., most frequent 
names) are already a relative good recommendation, the influence of global frequencies is desirable if only few names 
of a user are known. 

Finally WRMF also shows increasing prediction accuracy with increasing number known names. 
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Figure 5: Distribution of the Precision@5 and MAP scores for PPR+ with ten known names. 



Table 5: Correlations between the different evaluation metrics averaged for each model separately. 





Precision® 5 


Precision® 10 


Recall® 5 


Recall® 10 


NDCG@5 


NDCG@10 MAP 


Precision® 5 














Precision® 10 


0.94 












Recall® 5 


0.89 


0.87 










Recall® 10 


0.85 


0.88 


0.94 








NDCG@5 


0.94 


0.90 


0.86 


0.82 






NDCG@10 


0.37 


0.38 


0.34 


0.34 


0.36 




MAP 


0.79 


0.79 


0.79 


0.79 


0.81 


0.35 



12 



LeaveKOut In this experimental setup, as much of the user it's search history as possible is used for predicting the 
user's evaluation data Test(u). We may therefore derive from this experiment, which recommendation system yields 
(in average) the best prediction accuracy in a running system where users have broadly varying search history sizes. 

Table [6] shows the performance scores for predicting ten randomly chosen names, based on the remaining search 
log. We apply the sign test (cf. Section [3~T| i using the MAP scores to test the stated observations for significance and 
provide the correspondingly obtained p-values. Firstly we note that the trends observed in the previous experiment 
are affirmed. In particular PPR+ outperforms all but the user base collaborative filtering (p < 10~ 3 ). In case of 
user based collaborative filtering, the average performance scores indicate the PPR+ yields better recommendations. 
But considering the per user performance, i. e., the number of users where PPR+ yields better results, the sign test 
indicates that UCF performs best (p < 10~ 3 ). 



TakeFirstln/LeaveLastOut The previous experiments randomly selected names from a user's search history for 
predicting the user's remaining names. This approach has two advantages: Firstly, the effect of ordering of the entered 
names is averaged out and the discussion is therefore focused on the influence of merely the size of set of known names. 
Secondly, the repeated randomization smooths the results, thus reducing the effect of the relative small population size 
(in the considered setup only 1,230 users). 

Nevertheless, the order of input is the one which a recommendation system within a live setting has to deal with. 
We therefore also consider the TakeFirstln/LeaveLastOut evaluation protocol where the chronological order of the 
search history is used for splitting the evaluation data (cf. Section 3.3 for more details). 

Figure[6]shows Precision@5, NDCG@5 and MAP for the TakeFirstln experiment. Firstly, the result plots are less 
smooth than the corresponding plots for the TakeKIn experiments as the additional smoothing induced by repeated 
randomization is missing. Secondly, the relative assessment of UCF, ICF, MP, RND and WRMF is consistent with 
those of the randomized TakeKIn experimentation. Most interestingly, PPR performs better for all k than PPR+ (in 
contrast to the results of the TakeKIn experimentation). This, on the one hand, shows, that the order of input does 



Table 6: Leave-10-Out evaluation for all considered recommendation systems. 
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Figure 6: Precision accuracy of the different recommendation systems, relative to varying number of known names 
per user. 
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indeed matter. The worse assessment of PPR+ in the chronological order hints at the impact of global frequencies 
(i. e., popularity of names), as by construction, PPR+ increases personalization by reducing the influence of global 
frequencies. The results in Figure [6] indicate, that the first entered names tend to be more popular names and later a 
name is entered, the more specific the name is for the user. To underpin this assumption, we calculated for each name 
its popularity rank, that is, its position within the list of names which is decreasingly ordered by search frequency. We 
than calculated for each chronological position p the average popularity rank of all names which were search by some 
user at position p within the user's search history. Figure [JJ shows a positive correlation between the position in the 
user's search history and the popularity rank of the corresponding name, indeed indicating that users tend to search 
first for popular names. 

For reference we also summarize the prediction accuracy of the different recommendation systems for the Leave- 
LastOut protocol, which uses as much of the chronologically ordered names from the users search history for predict- 
ing the remaining (last) names in Table [7] The indicated performance scores support observed the tendencies in the 
preceding TakeFirstln experiment. 

For further supporting the observed characterizations of the different recommendation systems with respect to 
prediction accuracy, we applied the same evaluation protocol to the set Favorite(u) names per user u. As the corre- 
sponding evaluation data is much more sparse then the set Enter (u) of entered names, we only show the results for the 
LeaveKOut protocol with k — 5 and considering only users with at least eight favorite names (resulting in 230 cases). 
Figure [8] shows the corresponding results for Precision@5, NDCG@5 and MAP for all considered recommendation 
system respectively. The overall relative assessment of the different system is consistent with the results obtained on 
the set of entered names - only that PPR+ is now even outperforming UCF significantly (p = 0.0362). 

3.7 Discussion 

In this section we presented results of an experimental setup which aims at assessing the prediction accuracy of 
different recommendation systems. Beside establishing baseline results for reference, the presented evaluation guides 
the decision process of choosing a recommendation system for integration in a running application like Nameling. 




Position in User's Search History 

Figure 7: Average popularity rank for all entered names relative to the corresponding position within the users' search 
history, indicating that users tend to search first for popular names. To rule out statistical effects, we shuffled the user's 
search history and, depicting the corresponding results in Grey. 

Table 7: Leave-Last-Out evaluation for all considered recommendation systems. 
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Figure 8: Average performance metrics for the favorite names (Leave-5-Out with 25 repetitions and minimal number 
of favorite names per user is 8). 

Considering just the prediction accuracy, the standard preferential PageRank consistently shows best performance 
for users for which only one or two names are known. In any other case, the user based collaborative filtering approach 
outperforms all other systems. As it is important to generate good recommendations as soon as possible to catch the 
user's interest, it is worth to combine these approaches, recommending popular names to new users, PPR based 
recommendations for users with a small search history and UCF otherwise. 

But beside prediction accuracy, among others, run time efficiency is an important issue for recommendation sys- 
tems. In the context of tag recommendation systems it was observed that, in a live setting, many systems fail to provide 
recommendations within a justifiable time limit ifTTl . Also the handling of new items (i. e., names) and users has to be 
taken into account. In the case of name recommendations, the set of names is nearly constant over time, while there 
is a steady stream of new users using the system while there are very few users using the system for a long period of 
time. This is due to the fact that the need for a given name typically ends after a short time limit. 

Considering the run time efficiency, WRMF is a candidate due to the constant time operations with the latent user 
and item vectors. But also PPR an PPR+ are efficient to implement, as the results are just averaged over the individual 
query names of the user's search history. This operation can already be implemented within a database system which 
only holds the top similar names PPR+(i) and PPR(i), respectively, for every name i. As the set of names is nearly 
constant in time, good recommendations for new users can be produced on these precomputed vectors. 

Considering the results obtained on the chronologically ordered search history of users, PPR yields results as 
least as good as PPR+, as depicted in Figure [6] But the results obtained on randomized samples of the user's search 
history show significant better performance of PPR+. The observations in Section 3.6 indicate that the difference in 
the prediction accuracy of PPR and PPR+ between the chronological and randomized evaluation can be explained 
with the tendency that users tend to search first for popular names and later more special names. But this search 
habits might change with the availability of good recommendation systems. An important reason for the popularity 
correlated search order is due to the fact, that popular names are more known and a user first thinks of popular names 
while browsing the system for more suitable names. A good recommendation system will help the user in accessing 
suitable (i. e., personalized) names more directly. 

Summing up, our adaption PPR+ of the preferential PageRank is the most promising candidate for implementation 
in a running application. The choice of PPR+ is even more supported by considering the actual favorite names in 
Nameling, where it even significantly outperforms UCF (p — 0.04). 



4 Diversification of PageRank based Recommendations 

The apparent diversity-accuracy dilemma of recommender systems is especially relevant for the task of recommending 
names, as future parents are often interested in names they like, but which should not be too common within their en- 
vironment. Concerning prediction accuracy, the simple recommender which constantly recommends only the popular 



names, already yields reasonable results (cf. 3.6 1, but is useless for the parent's need to discover names which are 
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suitable for their children but are yet unknown to them. 

So far, we only considered the prediction accuracy for assessing the quality of the different recommendation ap- 
proaches. This section aims at comparing the recommendation systems with respect to the diversity of the obtained 
recommendations. For this, we firstly fix our notion of diversity and corresponding evaluation measures and sub- 
sequently use these measures for comparing the diversity of the most promising recommendation approaches from 
Section EU 

Finally, we consider different ways for increasing diversity of our PPR+ approach by incorporating background 
information based on networks of given names obtained from Wikipedia, Twitter and Nameling. 

4.1 Asessment of Diversity 

Diversity of a set of recommended items can either be assessed relative to semantic properties (such as gender or 
cultural context in the case of given names) or relative to its adaptation to the user's personal taste (i. e., how diverse 
recommendations for different users are). 

We focus on the aspect of adaptation, because we distinguish between the task of ranking and recommendation 
in the context of our work, at which the former targets the use case of searching and browsing for given names and 
accordingly benefits in particular from semantical result diversification, whereas the latter targets the use case of 
providing a small set of suitable names which adopts primarily to the user's personal taste. 

Assuming diversity of the user's personal taste, personalization of a given recommendation system Rec can be 
assessed by measuring the inter-user diversity for different users u and v. Accordingly, we consider the personalization 
index h, proposed in [43] h(u, v) := 1 — Rec (")nRec (v) ^ w jj ere R ec fc ( M ) denotes Rec's top k recommended items for 
user u. Identical results for u and v thus yield a value of h(u, v) = whereas completely different recommendations 
yield a value of h(u, v) = 1. The overall personalization performance /i(Rec) is assessed by averaging over all pairs 
of users. 

For assessing the prediction accuracy, we focus on the precision scores. Please note that in this section, the impact 
of varying number of known names on the prediction performance is not in the center of interest and therefore precision 
scores are calculated for each user on all but the known training items. Accordingly, the precision scores decrease with 
increasing number of known names, as at the same time the size of the corresponding test sets decreases. 

Fig.|9]shows personalization and precision scores for the most promising recommendation systems from Sec. |3.6| 
namely user based collaborative filtering, weighted matrix factorization, PPR and PPR+. For user based collaborative 
filtering and PPR, the personalization scores indicate the lowest diversity of the recommendation results, where the 
scores decrease with increasing number of known names. In contrast, the results obtained from WRMF and PPR+ are 
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Figure 9: Personalization as measure for assessing diversity and precision applied to the most promising recommen- 



dation systems from Sec. 3.5 
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at a respective constant level for the varying number of known names, at which PPR+ achieves the highest scores. The 
low personalization score for PPR can be explained with the dominance of global frequencies, i. e., popular names. 



4.2 Multi-Graph Aggregation for Increased Diversity 

With respect to prediction accuracy and both considered measures for diversity, our PPR+ approach already shows an 
superior trade-off compared to the other considered recommendation systems. Nevertheless, especially in the context 
of given names, diversity of recommendation results is important, as future parents are using such a system in order to 
find names which they like, but weren't aware of so far. 

In the following we present and compare different approaches for increasing the diversity of PPR+. The evaluation 
of diversity in an offline setting is very limited, as uncommon items (i. e., names with an high self-information score) 
are unlikely to be contained in an user's test set. Also, diversity by itself can be maximized by randomization, but at 
the same time, the results will show very low accuracy with respect to the user's personal taste. 

We therefore compare the different diversification approaches by considering the trade-off between prediction 
accuracy (in terms of precision) and diversity (in terms of personalization). All proposed diversification approaches 
encompass a diversification parameter if, which controls the impact of a second network of given names, intended 
for introducing diversity to the training data. Formally, we assume a set of graphs . . . ,G^ L ' with G^' =: 
(V^ , E^ ) over a common vertex set V 3 and edge weighting functions . . . , w^ L ' for £ = 1,...,L. 



Weighted Average As a baseline approach for combining ranking results obtained from different networks, we 
consider the weighted average ranking AveRank, which is for given query items 1 C V defined by AveRank(Z) := 

Z Ef=i Vi PPR+(Z) with Vt e [0, 1] and £f =1 Vl = 1. 

We now look for ways of combining the graphs by themselves in order to benefit from mutual reinforcement of 
important nodes during the convergence process of the PPR+ algorithm. Various ways for combining different graphs 
exist, of which we consider the following two. 



Conditional Multigraph PageRank The conditional multigraph PageRank is based on the idea of a multigraph 
built from G«, . . . , in which ed ges are additionally labeled corresponding the graph they originated from. 
Accordingly, multiple edges among vertex pairs may exist, as depicted in Fig. 10b Applying the intuition of the 
random surfer model , the basic idea of the conditional multigraph PageRank is, that a random surfer, reaching node 
u via a certain edge type, will more likely leave u using an edge of the same type, but may nevertheless leave u 
using any other incident edge. Thus, the importance of each graph's neighborhood is accented, but nevertheless, 
interrelations with other nodes as induced by other graphs are also considered. Furthermore, the transition probability 
of an edge (u, v) scales with the number of graphs containing it. By introducing according stochasticj^Jdamping factors 
(rf)i lt £ 2 € [0, l] Lxi , the overall inter-graph transition probabilities can be controlled. 

In order to apply standard PageRank calculations, we build a combined (non-multi) graph G = (V, E), which 
incorporates the conditional multi-graph approach by construction. For this, we firstly duplicate each node v G V 
according to the number of graphs, i. e., V := v{, . . . , u*, . . . , v\, . . . , v„. Secondly, we include each graph G/s 
t t, resulting in E^i := {(u e ,v l ) 6 V x V \ (u,v) € Eg}. Finally, for l\ ^ £ 2 , we set 
(i/ 2 , v i2 ) G Eg 2 } and E := (J E^ j 2 . Edge weights are given by the combined weighting 



connectivity within E, 
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function w{u tx ,v l2 ) := W£ 2 (i/ 1 , ?/ 2 ). Accordingly, G's adjacency matrix A is block-wise structured 
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where each n x n sub-matrix A 1 ' 1 corresponds to graph £'s adjacency matrix and A tl,t2 describes transitions from 
graph l\ to graph £2. For calculating PageRank scores on G, its adjacency matrix A has to be normalized in order to 



6 i. e., rows and columns sum up to one 
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be column-stochastic, i. e., all columns sum up to one. The accentuation of inner-graph connectivity is realized by 
introducing damping factors 17^,4, on the graph normalization, such that each of A* 1 ' 2 's columns sums up to r)£ lt e 2 . 
For obtaining the conditional multigraph PageRank score for a given I, we calculate MultiRank(I) := PPR+(I) on 
G. 



Parallel PageRank A simpler combination of G^ G^ can be achieved, by interconnecting each node v € V 
across all graphs. Fixing the notation introduced for the conditional multigraph PageRank, we build an according 
graph G = (V, E) by altering only the inter-graph connectivity to i 3 := {(u 1 , i/ 2 ) \ u £ V}. For obtaining the 
parallel multigraph PageRank score for a given I, we calculate PRank(Z) := PPR+(1) on G. 



Weighting Please note that the prediction performance of PPR+ depends on the edge weighting of the input graph. 
Accordingly, the weights of inter-network edges may be adopted in order to improve one of the evaluation targets (e. g., 
depending of the incident node's connectivity). For our evaluation, we set the weight of every such edge (ir 1 ,?/ 2 ) 
with l\ ^ ti to one in order to consider the most general case. 



4.3 Evaluation 

For evaluating the different result diversification approaches for PPR+, we consider the network of given names 
obtained from Nameling's usage data and combine it pairwise with the co-occurrence graph obtained from Wikipedia 
and respectively Twitter, as well as the shared category graph obtained from Wiktionary and the shared favorite graph 



from Nameling. We thereby apply the evaluation protocol from Sec. 3.6 except that all but the training names are used 
for evaluation. For considering the trade-off between prediction accuracy and diversity, we show the obtained results 
in a precision/diversity scatter plot. For reference, we also calculated the diversity of the considered user profiles by 
themselves (h = 0.99, 1 = 9.62, calculated on the first ten test items for k = 1, 5). 



Fig. 1 1 shows the accuracy/diversity scatter plots with respect to personalization on all considered pairs of net- 
works. We only show results obtained for the TakeFirstln protocol with k = 5. The results only differ in magnitude for 
varying k but the overall characteristics are unchanged. We note that all considered multigraph approaches introduce 
varying amounts of diversity and an accordingly differing level of accuracy. We firstly consider the personaliza- 
tion/precision trade-off. The weighted average approach shows similar characteristics for all considered networks. 
Except for the combination with the Twitter based network, both combined multigraph approaches outperform the 
considered baseline, where PRank shows overall smother transitions and covers broader ranges of the personalization 
metric. Nevertheless, MultiRank is the only approach which shows increasing tendencies for the precision scores 




(a) Three separate graphs (b) The merged multigraph (c) The combined graphs 

Figure 10: Construction for the conditioned multigraph PageRank: Separate graphs (a), the merged multigraph (b) 
and the combined graphs, where nodes are duplicated according to the number of networks (c). The construction for 
the conditional multigraph PageRank is shown on top and the for the parallel PageRank on the bottom right. 
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Wikipedia @5 



Wiktionary @5 




Figure 11: Diversity vs. accuracy scatter plots for all considered multigraph approaches in the co-occurrence network 
obtained from Twitter (top row), the co-occurrence network obtained from Wikipedia (2nd row), the shared categories 
network (3rd row) and shared favorite names network (bottom row) for recommendations based on one, five or ten 
known names in the left, middle and right column respectively. 



with increasing personalization score. In combination with the Twitter based network, both combined multigraph 
approaches show inferior performance than the baseline model. 

Summing up, we observe differing characteristics for the accuracy/diversity lemma, hinting at a more pronounced 
dependence of MultiRank on the inter-network correlations. Nevertheless, the results obtained from the different 
approaches are complementary and ultimately only a live evaluation of the different graph combinations will reveal, 
which result diversification approach corresponds best to the user's expectations. 



5 Related Work 

The present work introduces a new field of application for analyzing relations among named entities and recommen- 
dation systems. It is motivated by the work on the search engine "Nameling", which is presented in ESI . The ranking 
performance of the structural similarity metrics based on the Wikipedia corpus relative to the actual usage data which 
accrued in the running system is evaluated in [27 28 1. The considered approaches are based on work in the field of 
distributional semantics, link prediction and (more generally) vertex similarity in graphs as well as recommendation 
systems. 

Distributional Semantics The field of distributional semantics relatedness has attracted a lot of attention in literature 
during the past decades (see for a review). Several statistical measures for assessing the similarity of words are 
proposed, as for example in ll20l ITTl [T4l [181 [39). Notably, first approaches for using Wikipedia as a source for 
discovering relatedness of concepts can be found in ll2l[37l[8l. 

Vertex Similarity & Link Prediction In the context of social networks, the task of predicting (future) links is espe- 
cially relevant for online social networks, where social interaction is significantly stimulated by suggesting people as 
contacts which the user might know. From a methodological point of view, most approaches build on different similar- 
ity metrics on pairs of nodes within weighted or unweighted graphs |[T6l[T9ll23~ll24ll . A good comparative evaluation 



19 



of different similarity metrics is presented in [21 1. The construction of multigraph extensions to the NameRank algo- 
rithm for controlled result diversification are based on [ 35 1, where random walks are used to combine the information 
of different networks. 

Recommender Systems Recommending given names is just a special case for the item recommendation task which 
is extensively discussed for various fields of application, such as movies [10], tags [15| and products ll22l . Notably 
the characteristics of deriving recommendations from activity logs (rather than explicit user feedback) are discussed 
in |[l"3ll where the uncertainty of implicit feedback taken into account, by attaching an adequate level of confidence 
to an implicitly expressed interest in an item (e. g., by clicking on the description of a given name). A very good 
introduction and summary of the evaluation of recommendation systems can be found in [36|. 
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